Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera’s Tile Processor
نویسندگان
چکیده
This paper describes our experience with profiling and optimizing physical locality for the distributed shared cache (DSC) in Tilera’s Tile multicore processor. Our approach uses the Tile Processor’s hardware performance measurement counters (PMCs) to acquire page-level access pattern profiles. A key problem we address is imprecise PMC interrupts. Our profiling tools use binary analysis to correct for interrupt “skid,” thus pinpointing individual memory operations that incur remote DSC slice references and permitting us to sample their access patterns. We use our access pattern profiles to drive page homing optimizations for both heap and static data objects. Our experiments show we can improve physical locality for 5 out of 11 SPLASH2 benchmarks running on 32 cores, enabling 32.9%–77.9% of DSC references to target the local DSC slice. To our knowledge, this is the first work to demonstrate page homing optimizations on a real system.
منابع مشابه
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring
Private caching and shared caching are the two conventional approaches to managing distributed L2 caches in current multicore processors. Unfortunately, neither shared caching, nor private caching guarantees optimal performance under different workloads, especially when many processor cores and cache slices are provided on a switched network. This paper takes a very different approach from the ...
متن کاملParallelizing the ZSWEEP Algorithm for Distributed-Shared Memory Architectures
In this paper we describe a simple parallelization of the ZSWEEP algorithm for rendering unstructured volumetric grids on distributed-shared memory machines, and study its performance on three generations of SGI multiprocessors, including the new Origin 3000 series. The main idea of the ZSWEEP algorithm is very simple; it is based on sweeping the data with a plane parallel to the viewing plane,...
متن کاملParallelizing the ZSWEEP Algorithm for Distributed-Shared Memory Architectures (ST)
In this paper we describe a simple parallelization of the ZSWEEP algorithm for rendering unstructured volumetric grids on distributed-shared memory machines, and study its performance on three generations of SGI multiprocessors, including the new Origin 3000 series. The main idea of the ZSWEEP algorithm is very simple; it is based on sweeping the data with a plane parallel to the viewing plane,...
متن کاملA Case for Fine-Grain Adaptive Cache Coherence
As transistor density continues to grow geometrically, processor manufacturers are already able to place a hundred cores on a chip (e.g., Tilera TILE-Gx 100), with massive multicore chips on the horizon. Programmers now need to invest more effort in designing software capable of exploiting multicore parallelism. The shared memory paradigm provides a convenient layer of abstraction to the progra...
متن کاملThe Performance Value of Shared Network Caches in Clustered Multiprocessor Workstations
This paper evaluates the bene t of adding a shared cache to the network interface as a means of improving the performance of networked workstations con gured as a distributed shared memory multi processor A cache on the network interface shared by all processors on each cluster o ers the potential bene ts of retaining evicted processor cache lines providing implicit prefetching when network cac...
متن کامل